Preprocessing


Notebook overview

This notebook alters the features to best capture the true nature of the data, based on the insights from the previous notebook. To do this, we define a class the encapsulates all of this logic, then wrap it in a reusable pipeline to ensure that there is no data leakage throughout the modeling process.

Tasks:

  • Apply custom feature engineering logic (FeatureEngineer) to extract meaningful patterns.
  • Encode categorical variables using one-hot encoding.
  • Scale numeric features to help stabilize logistic regression modeling.
  • Combine all preprocessing steps into a single Pipeline object.
  • Save the full pipeline with joblib so we can apply it consistently later.

Notebook outline:

  1. Reload data
  2. Validation
  3. Feature engineering pipeline
  4. Preprocessing summary
Code
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np

from imblearn.over_sampling import SMOTE
import joblib

import sys
sys.path.append('../src')  
from feature_engineering import FeatureEngineer

from sklearn.pipeline import Pipeline

pd.set_option('display.max_columns', None)

print("Preprocessing environment initialized.")
Preprocessing environment initialized.

1. Reload data


We reload the cleaned dataset (data_01.csv) and validate that all expected columns are present — no extras, none missing.

Code
# Load cleaned data
df = pd.read_csv('../data/processed/data_01.csv')

# Define expected column names after EDA cleanup
expected_columns = [
    'Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
    'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction',
    'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
    'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
    'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
    'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears',
    'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
    'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'
]

# Get actual columns from the loaded DataFrame
actual_columns = list(df.columns)

# Compare against expected
missing_columns = set(expected_columns) - set(actual_columns)
unexpected_columns = set(actual_columns) - set(expected_columns)

# Display results
if not missing_columns and not unexpected_columns:
    print("Column schema validation passed.")
else:
    if missing_columns:
        print("Missing columns:", missing_columns)
    if unexpected_columns:
        print("Unexpected columns:", unexpected_columns)
Column schema validation passed.

2. Validation


Before preprocessing, we run validations to ensure: - No missing values or constant columns remain - Correct data types - Target (Attrition) is balanced

Code
# Check for nulls (none expected)
null_counts = df.isnull().sum()
if null_counts.any():
    print("Unexpected null values found:")
    display(null_counts[null_counts > 0])
else:
    print("No null values found.")

print("\nData types:")
display(df.dtypes)

# Identify constant columns
nunique = df.nunique()
constant_cols = nunique[nunique == 1].index.tolist()

if constant_cols:
    print(f"Constant columns detected and dropped: {constant_cols}")
    df.drop(columns=constant_cols, inplace=True)
    print(f"New shape after dropping: {df.shape}")
else:
    print("No constant columns detected.")

# Confirm target variable distribution
print("\nClass balance in 'Attrition':")
display(df['Attrition'].value_counts(normalize=True).round(3))
No null values found.

Data types:
Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64
dtype: object
No constant columns detected.

Class balance in 'Attrition':
Attrition
No     0.839
Yes    0.161
Name: proportion, dtype: float64

3. Feature engineering pipeline


  • This section creates custom features to capture patterns not directly visible in the raw data. We encapsulate this logic inside of a class, FeatureEngineer(), and then merge this into a Pipeline to prevent data leakage and ensure consistent preprocessing steps are applied.

FeatureEngineer() and make_preprocessing_pipeline

To capture interactions between features, and to make features suitable for modeling, all feature engineering logic is placed inside of class FeatureEngineer. This is helpful because it avoids having to repeat logic in subsequent notebooks.

Below is a breakdown of each added/modified feature:

Tenure and experience features

TenureCategory
Buckets YearsAtCompany into tenure groups:
- 0–3 yrs
- 4–6 yrs
- 7–10 yrs
- 10+ yrs
This captures key career stage segments, which may correspond to different attrition risks.

TenureGap
Calculates: YearsInCurrentRoleYearsAtCompany
Employees who may have changed roles internally versus those who stayed static, potentially indicating engagement or stagnation.

TenureRatio
Calculates: YearsInCurrentRole / YearsAtCompany
Identify fast or slow transitions. High ratios may indicate stagnation, while low ratios may indicate fast promotions or instability.

ZeroCompanyTenureFlag
Binary flag indicating YearsAtCompany == 0
Captures newly joined employees who may behave differently.

NewJoinerFlag
Flags employees with: - YearsAtCompany < 2
- TotalWorkingYears > 3
These are experienced employees that recently joined - a group that may behave differently due to habits or philosophies from previous jobs.

Role and work features

Overtime_JobLevel
Interaction between OverTime and JobLevel
Useful for identifying levels of staff that are potentially overworked.

Travel_Occupation
Combined effect of travel frequency and job role.
Identify roles with high levels of travel which correlates with elevated attrition risk.

Satisfaction features

SatisfactionMean
Averages the satisfaction scores:
- EnvironmentSatisfaction
- JobSatisfaction
- RelationshipSatisfaction
Provides a general overview of employee sentiment.

SatisfactionRange
Calculates range of the 3 satisfaction scores
Inconsistency in perceived satisfaction, potentially indicating internal conflict or instability.

SatisfactionStability
Binary flag: 1 if all 3 satisfaction scores are equal
Identifies employees with consistent satisfaction levels across all domains.

Financial features

Log_MonthlyIncome
Applies log transform to MonthlyIncome
Reduce skew and compress extreme values.

Log_DistanceFromHome
Applies log transform to DistanceFromHome
Reduce skew and compress extreme values.

LowIncomeFlag
Binary flag for employees earning below the 25th percentile of income
Captures possible financial dissatisfaction.

Burnout risk

StressRisk
Binary flag for employees where:
- OverTime == Yes
- JobSatisfaction ≤ 2
- SatisfactionMean < 2.5
Combines workload and dissatisfaction into a high-risk signal for possible voluntary attrition.

Preprocessing pipeline definition

We finalize the preprocessing logic here by defining which columns to encode, scale, or pass through unchanged:

  • Categorical variables are one-hot encoded.
  • Continuous numeric features are standardized with StandardScaler.
  • Binary flags from feature engineering are passed through untouched.
  • All transformations are bundled into a ColumnTransformer, which is embedded in a reusable Pipeline.

This pipeline will be saved and applied during modeling (03_modeling.ipynb) to ensure consistent preprocessing and no data leakage.

Code
def make_preprocessing_pipeline():

    # One-hot encode categorical features
    nominal_cols = [
        'Department', 'EducationField', 'Gender', 'MaritalStatus',
        'OverTime', 'TenureCategory', 'OverTime_JobLevel', 'Travel_Occupation'
    ]

    # Standardize continuous features 
    scale_cols = [
        'Age', 'DistanceFromHome', 'HourlyRate', 'JobInvolvement', 'JobLevel',
        'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
        'PerformanceRating', 'StockOptionLevel', 'TotalWorkingYears',
        'TrainingTimesLastYear',
        'YearsSinceLastPromotion', 'YearsWithCurrManager',
        'TenureRatio', 'TenureGap', 'SatisfactionMean', 'SatisfactionRange',
        'PromotionPerYear', 'YearsCompany_Satisfaction',
        'Log_MonthlyIncome', 'Log_DistanceFromHome'
    ]

    # Pass through binary flags
    passthrough_cols = [
        'ZeroCompanyTenureFlag', 'NewJoinerFlag', 'LowIncomeFlag',
        'SatisfactionStability', 'StressRisk'
    ]

    # Build column transformer
    preprocessor = ColumnTransformer(transformers=[
        ('nominal', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), nominal_cols),
        ('scale', StandardScaler(), scale_cols),
        ('passthrough', 'passthrough', passthrough_cols)
    ])

    # Full pipeline
    pipeline = Pipeline(steps=[
        ('feature_engineering', FeatureEngineer()),
        ('preprocessing', preprocessor)
    ])

    return pipeline

Export pipeline

We export the preprocessing pipeline unfitted here, so that we can fit in the next notebook on only the training set.

Code
# Fit on cleaned data (exclude target)
df_clean = df.drop(columns='Attrition')
print(df_clean.columns)
pipeline = make_preprocessing_pipeline()

# Save pipeline
joblib.dump(pipeline, '../models/preprocessing_pipeline.pkl')
print("Preprocessing pipeline saved.")
Index(['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole',
       'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
Preprocessing pipeline saved.

Preprocessing summary


This notebook:

  • Applies feature logic with FeatureEngineer
  • Encodes categorical features using OrdinalEncoder and OneHotEncoder
  • Scales numerical features with StandardScaler
  • Wraps everything into a reusable Pipeline

This exported pipeline ensures consistent preprocessing across training and evaluation.